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Abstract — To reduce computational complexity and delay in 
randomized network coded content distribution (and for some 
other practical reasons), coding is not performed simultaneously 
over all content blocks but over much smaller subsets known 
as generations. A penalty is throughput reduction. We model 
coding over generations as the coupon collector's brotherhood 
problem. This model enables us to theoretically compute the 
expected number of coded packets needed for successful decoding 
of the entire content, as well as a bound on the probability of 
decoding failure, and further, to quantify the tradeoff between 
computational complexity and throughput. Interestingly, with a 
moderate increase in the generation size, throughput quickly 
approaches link capacity. As an additional contribution, we derive 
new results for the generalized collector's brotherhood problem 
which can also be used for further study of many other aspects 
of coding over generations. 

I. Introduction 

In P2P systems, such as BitTorrent, content distribution 
involves fragmenting the content at its source, and using 
swarming techniques to disseminate the fragments among 
peers. Acquiring a file by collecting its fragments can be to a 
certain extent modeled by the classic coupon collector prob- 
lem, which indicates some problems such systems may have. 
For example, probability of acquiring a novel fragment drops 
rapidly as the number of those already collected increases. In 
addition, as the number of peers increases, it becomes harder 
to do optimal scheduling of distributing fragments to receivers. 
One possible solution is to use a heuristic that prioritizes 
exchanges of locally rarest fragments. But, when peers have 
only local information about the network, such fragments often 
fail to match those that are globally rarest. The consequences 
are, among others, slower downloads and stalled transfers. 
Consider file at a server that consists of N packets, and a client 
that chooses a packet from the server uniformly at random 
with replacement. Then the process of downloading the file 
is the classical coupon collection, and the expected number 
of samplings needed to acquire all N coupons is 0(N log TV) 
(see for example JT|), which does not scale well for large N 
(large files). 

Randomized coding systems, such as Microsoft's Avalanche 
Q, attempt to lessen such problems in the following way. 
Instead of distributing the original file fragments, peers pro- 
duce linear combinations of the fragments they already hold. 
These combinations are distributed together with a tag that 
describes the coefficients used for the combination. When 
a peer has enough linearly independent combinations of the 
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original fragments, it can decode and build up the original file. 
The information packets can be decoded when N linearly in- 
dependent equations have been collected. For a large field, this 
strategy reduces the expected number of required samplings 
from 0{ N log N) to almost N (see for example Q). However, 
the expense to pay is the computational complexity. If the 
information packets consist of d symbols in GF(q), it takes 
0(Nd) operations in GF(q) to form a linear combination per 
coded packet, and 0(N 3 + N 2 d) operations, or, equivalently, 
0(N 2 + Nd) operations per information packet, to decode the 
information packets by solving linear equations. 

We refer to the number of information packets combined 
in a coded packet as the degree of the coded packet. The 
complexity of computations required for solving the equations 
of information packets depends on the average degree of coded 
packets. To reduce computational complexity while maintain- 
ing the throughput gain brought by coding, several approaches 
have been introduced seeking to decrease the average degree 
of coded packets 0], J5), J6). Nevertheless, it is hard to design 
distributed coding schemes with good throughput/complexity 
tradeoff that further combine coded packets. 

Chou et al. Q proposed to partition information packets into 
disjoint generations, and combine only packets from the same 
generation. The performance of codes with random scheduling 
of disjoint generations was first theoretically analyzed by 
Maymounkov et al. in |8|, who referred to them as chunked 
codes. Chunked codes allow convenient encoding at inter- 
mediate nodes, and are readily suitable for peer-to-peer file 
dissemination. In (8), the authors used an adversarial schedule 
as the network model and measured the code performance by 
estimating the loss of "code density" as packets are combined 
at intermediate nodes throughout the network. 

In this work, we propose a way to analyze coding with 
generations from a coupon collection perspective. Here, we 
view the generations as coupons, and model the receiver who 
needs to acquire multiple linear equations for each generation 
as a collector seeking to collect multiple copies of the same 
coupon. This collecting model is sometimes refereed to as 
the collector's brotherhood problem, as in 0. As a classical 
probability model which studies random sampling of a popu- 
lation of distinct elements (coupons) with replacement Qj, the 
coupon collector's problem finds application in a wide range of 
fields ifTUl , from testing biological cultures for contamination 
ifTTI to probabilistic-packet-marking (PPM) schemes for IP 
traceback problem in the Internet fl2l . We here use the ideas 
from the collector's brotherhood problem, to derive refined 



results for expected throughput for finite information length 
in unicast scenarios. We describe the tradeoff in throughput 
versus computational complexity of coding over generations. 
Our results include the asymptotic code performance, by 
proportionally increasing either the generation size or the 
number of generations. The code performance mean con- 
centration leads us to a lower bound for the probability of 
decoding error. Our paper is organized as follows: Sec. [XT] 
describes the unicast file distribution with random scheduling 
and introduces pertaining results under the coupon collector's 
model. Sec. [TIT] studies the throughput performance of coding 
with disjoint generations, including the expected performance 
and the probability of decoding failure. Sec. IIVI concludes. 

II. Coding over Generations 

A. Coding over Generations in Unicast 

We consider the transmission of a file from a source to 
a receiver over a unicast link using network coding over 
generations. 

The file is divided into N information packets, p\, p2, 
. . . , pn- Each packet pi(i = 1,2,..., N) is represented as 
a column vector of d information symbols in Galois field 
GF(q). Packets are sequentially partitioned into n = j- 
generations of size h, denoted as G\, Gi, G n . Gj = 
\P{i-i)h+i,P(i-i)h+2, ■ ■ ■ ,Pih] for j = 1,2, ...,n. The re- 
ceiver collects coded packets from the n generations. The 
coding scheme is described as follows: 

a) Encoding: In each transmission, the source first se- 
lects one of the n generations with equal probability. Assume 
Gj is chosen. Then the source chooses a coding vector e of 
length h, with each entry chosen independently and equally 
probably from GF(q). A new packet p is then formed by 
linearly combining packets from Gj by e: p = Gje. The 
coded packet p is then sent over the communication link to 
the receiver along with the coding vector e and the generation 
index j. 

b) Decoding: The receiver gathers coded packets and 
their coding vectors. We say that a receiver collects one more 
degree of freedom for a generation if it receives a coded packet 
from that generation with a coding vector linearly independent 
of previously received coding vectors from that generation. 
Once the receiver has collected h degrees of freedom for a 
certain generation, it can decode that generation by solving a 
system of h linear equations in h unknowns over GF(q). 

Note that the two extremes, generation sizes h = N and 
h = 1, correspond respectively to full random linear network 
coding(i.e. without generations) and not using coding at all(but 
with random piece selection). 

Since in the coding scheme described above, both the coding 
vector and the generation whose packets are combined are 
chosen uniformly at random, the code is inherently rateless. 
We measure the throughput by the number of coded packets 
necessary for decoding all the information packets. 

c) Computational Complexity: It takes 0(hd) operations 
to form each linear combination of h length-d vectors(packets) 
in GF(q). The computational cost for encoding is then 0(hd) 



per coded packet. Meanwhile, it takes 0(h 3 + h 2 d) operations 
in GF(q) to solve h linearly independent equations for the 
h packets of one generation. Thus, the cost for decoding is 
0(h 2 + hd) per information packet. 

B. Extension to a General Network 

In a general network, intermediate nodes follow a similar 
coding scheme except that previously received coded packets 
are combined instead of the original information packets. We 
leave out the details since the analysis of code performance in 
general network topologies is beyond the scope of this paper. 

C. Collector's Problem and Coding Over Generations 

In a coupon collector's problem, a set of distinct coupons 
are sampled with replacement. Consider the n generations 
as n distinct coupons, and collecting degrees of freedom 
for generation Gi is analogous to collecting copies of the 
ith element in the coupon set. In the next section, we will 
characterize the throughput performance of the code under the 
coupon collector's probability model. 

III. Throughput of Coding over Generations 

We next study the throughput performance of coding over 
generations in the simplest single-server-single-user scenario 
by using the collector's brotherhood model J9). Recall that, 
to successfully recover the file, the receiver has to collect h 
degrees of freedom for each generation. For i = 1, . . . , n, let 
Ni be the number of coded packets sampled from Gi until h 
degrees of freedom are collected. Then, iV^s are i.i.d. random 
variables with the expected value f3) 
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Note that iVj < s means the h x s matrix formed by s 
coding vectors of length h as columns is of full row rank. 
For s < h, Pr[N t < s] = 0. For s > h, Pr[iVi < s] equals 
the probability that for each k = 1,2, ... ,h, the fcth row in 
the matrix is linearly independent of rows 1 through (k — 1). 
Hence, 

h-l h-l 
Pr[N t < S ] = U {{q S - q k ) /q S ) - l[(l q^)- (2) 

We have the following Lemma [T] upper bounding the comple- 
mentary cumulative distribution function (CCDF) of iV,-. 

Lemma 1: There exist positive constants a q: h and 012,00 
such that, for s > h, 
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Proof: Please refer to Appendix. ■ 
Let T(n, to) be the number of coded packets collected when 
for the first time there are at least m(> 1) coded packets from 
every generation in the collection at the receiver. The total 
number of coded packets needed for accumulating Ni{> h) 
coded packets of each generation G; is then greater or equal 
to T(n,h). 

The collector's brotherhood problem^, also referred to as 
the double dixie cup problem (l3\ . investigates the stochastic 
quantities associated to acquiring m(> 1) complete sets of n 
distinct elements by random sampling. 

A. Results From The Collector's Brotherhood Problem 
For any to G N, we define S m {x) as follows: 

Sm{x)=l + Y\ + x i^--- + i^-A)\ (m ^ 1} (3) 

Soo(x) =cxp(.T) and S (x) = 0. (4) 

Theorem 2: Consider uniformly random sampling of n dis- 
tinct coupons with replacement. Suppose for some A € N, 
integers k\, . . . , Ua and mi, . . . , ttia satisfy 1 < k\ < ■ ■ ■ < 
kA < n and mi > • • • > tua > 1. For convenience of 
notation, let too = oo and ttia+i = 0. Then, the expected 
number of samplings needed to acquire at least mi copies of 
at least ki coupons, at least TO2 copies of at least &2 coupons, 
and so on, at least tjia copies of at least fc^ coupons in the 
collection is 



Theorem 5: (03) When n — > oo, 

E[T(n, m)] = n log n + (m — l)?iloglogn + C m n + o(n), 

where C m = 7— log(m— 1)!, 7 is Euler's constant and m € N. 
For m > 1, on the other hand, we have lfl"3l 



E[T(n, m)] — > nm. 



(7) 



What is worth mentioning is that, as the number of coupons 
n — > 00, for the first complete set of coupons, the num- 
ber of samplings needed is O(nlogn), while the additional 
number of samplings needed for each additional set is only 
C(nlog logn). 

In addition to the expected value of T(n,m), the concen- 
tration of T(n, to) around its mean is also of great interest 
to us. We can derive from it an estimate of the probability 
of successful decoding after gathering a certain number of 
coded packets. The generating function of T(n, to) and its 
probability distribution are given in (9), but it is quite difficult 
to evaluate them numerically. We will instead look at the 
asymptotic case where the number of coupons n —> 00. Erdos 
and Renyi have proven in lfl6l the limit law of T(n, m) as 
n — > 00. Here we restate Lemma B from lfT4l by Flatto, which 
in addition expresses the rate of convergence to the limit law. 

Lemma 6: |[T4ll Let 

Y(n, to) = — {T(n, m) — n log n — (m — l)n log log n) . 
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Remark 1: (Remark 2, 03) The estimation in Lemma|6]is 
understood to hold uniformly on any finite interval —a<y< 
a. i.e., for any a > 0, 



Proof: Our proof generalizes the symbolic method of 
ifTJl . Please refer to Appendix. ■ 
Setting A = 1, ki = k and mi = m in Theorem [2] gives 
the following corollary: 

Corollary 3: The expected number of samplings needed to 
collect at least k of the n distinct coupons for at least to times 



Pr[y(n, to) < y] — exp I — 
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In the context of coding over generations, this corollary 
gives us an estimation of the growth of the size of decodable 
information. It is also helpful to the study of a coding scheme 
with a "precode" as discussed in (8). 

Furthermore, by setting k = n in Corollary [3] we recover 
the following result of lfT3l : 

Corollary 4: The expected sampling size to acquire m 
complete sets of n coupons is 

/>oo 

E[T(n, m)]=n [l - (1 - S m {x) e - X ) n ] dx (6) 
Jo 

© can be numerically evaluated for finite to and n. 

The asymptotic of E[T(n,m)] for large n has been dis- 
cussed in literature such as lfT3l . lfT4l and |[T5l . 



n > 2 and —a<y<a. C(m, a) is a positive constant 
depending on m and a, but independent of n. 

B. Throughput 

The number of coded packets T the receiver needs to collect 
for successful decoding is lower bounded by T(n, h), and so 
E[T] is lower bounded by E[T(n, h)]. Also, by LemmaQ] Ni 
is well concentrated near h for large q, and so T(n, h) could 
be a good estimate for T for finite n. In addition, from Thm. [5] 
and (O we observe that, when n > 1 or ra > 1, E[T(n, to)] 
is linear in to. Thus, we could substitute E[Ni] for m in 
these expressions to roughly estimate the asymptotic expected 
number of coded packets needed for successful decoding. 

From Lemma [6] we obtain the following lower bound to 
the probability of decoding failure as n — > 00: 

Theorem 7: When n — > 00, the probability of decoding 
failure when t coded symbols have been collected is greater 
than 1 — exp 



TK^aogr^- 1 exp (-1)] +0 (tegp) 



Proof: The probability of decoding failure after acquiring 
t coded packets equals Pr[T > t\. Since T > T(n, h), 

Pr[T > t] >Pr[T(n,h) > t] 



=1 -Pr 



y(n, h) < logn — (m — 1) log log n 



The result in Theorem [7] follows directly from Lemma [6] 
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Fig. 1. (a)Estimates of T, the number of coded packets needed for 
successful decoding when the total number of information packets N = 1000: 
E[T(n, h)]; average of T in simulation (q = 256); n — ¥ oo asymptotic 
Theorem [5] m S> 1 asymptotics 0- (b)Estimates of probability of decoding 
failure versus the number of coded packets collected: Theorem [7] along with 
simulation results (q = 256). 



Figure L H|a)| shows several estimates of T for fixed N = 
nh = 1000 versus generation size h. It is worth noting that, 
for fixed N, E[T(n,h)] drops significantly as h is increased 
to a relatively small value. For the example N = 1000, 
when each generation includes 1/10 of all the information 
packets, the expected overhead required for successful de- 
coding is below 16%. In such scenarios as assumed in our 
model, where feedback is minimal, the benefit of coding on 
throughput is pronounced even when it is done in relatively 
small information block length. In effect, communication 
overhead due to the exchange of control messages makes way 
for moderate computational complexity at individual nodes. 



This is particularly meaningful to communication networks in 
shared media, in which contention occurs frequently, and also 
to networks of nodes with medium computing power. 

Figure [l |Ib)| shows the estimate of the probability of decod- 
ing failure versus T. As pointed out in Remark[U the deviation 
of the CDF of T(n, m) from the limit law for n — > oo depends 
on m and is on the order of 0{ lo f^ n ), which is quite slow. 
This is also implied in our observation from Figure L HIa)| that 
Thm. gives a good estimate of E[T(n,m)] only for very 
small values of m(compared to n). 

IV. Conclusion And Future Work 

We investigated the throughput performance of coding over 
generations in the unicast scenario under the classical yet ever- 
useful coupon collector's model. We derived a general formula 
(Theorem |2]i to compute the expected number of samplings 
necessary for collecting multiple copies of n distinct coupons. 
The formula can be applied in various ways to analyze a 
number of different aspects of coding over generations. Here 
in particular, we used a special case of this result, namely, the 
expected number of samplings E[T(n, h)] needed to collect h 
complete sets of n distinct coupons to estimate the expected 
number of coded packets necessary for successful decoding 
when N information packets are encoded over n generations 
of size h each. Apart from results for finite information 
length N, the asymptotics of E[T(n, h)] can also be used to 
estimate the performance of coding when either the number of 
generations n or the generation size h go to inifinity. We also 
gave a lower bound for the probability of decoding failure by 
using the limit law for T(n, h) as n — > oo. 

The general result expressed in Theorem [2] has proved 
its usefulness to the analysis of coding over overlapping 
generations in our recent work ifiTl . Further, we have been 
able to derive in lfl8l the exact expression, as well as an upper 
bound, for the expected number of coded packets necessary for 
successful decoding. We expect to extend our analysis under 
the coupon collection framework to many other aspects of 
coding over generations. 
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Appendix 

Proof of Lemma Q] 

For i = 1,2, ... , n and any s > h, we have 
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The claim is obtained by setting 

a q , h = -lnPr{Ni < h}, a 2t00 = - Km lnPr{/Yi < h}. 
Proof of Theorem [2] 

Our proof generalizes the symbolic method of |fl3l . Let E be 
the event that, in the acquired collection, there are at least m\ 
copies of at least k\ coupons, at least m 2 copies of at least k 2 
coupons, and so on, at least tua copies of at least coupons. 
For t > 0, let E(t) be the event that E has occurred after t 
samplings, and let l^t) be the indicator that takes value 1 if 
E(t) does not occur and otherwise. Then random variable 
W = 1^(0) + l^(i) + • ■ ■ equals the waiting time for E to 
occur. For t > 0, let 7r t = Prob[j£(£)]; we then have 



E[W] 



t>o 



(8) 



To derive 7r t , we introduce an operator / acting on an n- 
variable polynomial g. For a monomial x™ 1 . . . x™ n , let ij be 
the number of exponents w u among w\ , . . . , w n satisfying 
w u > rrij, for j = I,..., A. f removes all monomials 
x™ 1 ...^™" in g satisfying i\ > ki,...,iA > and 
ii < ■ • • < M- Note that / is a linear operator, i.e., if g\ 
and 52 are two polynomials in the same rt-variables, and a 
and b two scalars, we have af(gi) + 6/(32) = + bg 2 )- 

Each monomial in {x\ + ■ ■ ■ + x n ) 1 corresponds to one 
of the n* possible outcomes after t samplings, with the 
exponent of Xi being the number of collected copies of the 
ith coupon. Hence, the number of outcomes counted in E(t) 
equals f((x\ + ■ ■ ■ + x n Y) evaluated at xi = • • • = x n = 1. 
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and thus, by ®, E[W] = E t>0 f ^ Xl+ ^^ \ Xl= ... =Xn=1 . 
Making use of the identity ^ = n jiy t e~ nv dy, and 
because of the linearity of operator /, we further have 

/(Oi + •••+£„)*) 
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evaluated at X\ = ■ ■ ■ = x n = 1. 

We next need to find the sum of the monomials in the 
polynomial expansion of exp(xi + ■ • • + x n ) that should be 
removed under /. If we choose integers = iq < i\ < ■ ■ ■ < 
tA < iA+i = n, such that ij > kj for j = 1, . . . , A, and then 
partition indices {1, . . . , n) into (A + l) subsets X±, . . . ,Ia+i, 
where Xj(j = 1, . . . , A + 1) has ij — ij_i elements. Then 



A+l 



3=1 ielj 



(10) 



equals the sum of all monomials in cxp(xi + • • • + x n ) with 
(ij — ij-i) of the n exponents smaller than rrij-i but greater 
than or equal to rrij, for j = 1, . . . , A+l. (Here S is as defined 
by d3j4]i.) The number of such partitions of {1, . . . ,n} is equal 
to ( ™ ) = Y\f n C i+1 ). Finally, we need to sum 
the terms of the form ( II Oi l over all partitions of all choices of 
i\, . . . , Ia satisfying kj < ij < ij+i for j = 1, . . . ,A: 

f (exp(xiy H h x n y)) \ xl =...= Xn=1 = exp(ny)- 



E 

(«0.«1 'A+l) 

i a=°, i A+i= n 

j = l,2,...,A 



A 

n 

3=0 



'3 + 1 



[s m M-s mi +Mf + ^ ij 



(11) 



Bringing (fTTT > into <j9j gives our result in Theorem |2] 



